75 research outputs found

    An analysis of local and global solutions to address Big Data imbalanced classification: a case study with SMOTE preprocessing

    Get PDF
    Addressing the huge amount of data continuously generated is an important challenge in the Machine Learning field. The need to adapt the traditional techniques or create new ones is evident. To do so, distributed technologies have to be used to deal with the significant scalability constraints due to the Big Data context. In many Big Data applications for classification, there are some classes that are highly underrepresented, leading to what is known as the imbalanced classification problem. In this scenario, learning algorithms are often biased towards the majority classes, treating minority ones as outliers or noise. Consequently, preprocessing techniques to balance the class distribution were developed. This can be achieved by suppressing majority instances (undersampling) or by creating minority examples (oversampling). Regarding the oversampling methods, one of the most widespread is the SMOTE algorithm, which creates artificial examples according to the neighborhood of each minority class instance. In this work, our objective is to analyze the SMOTE behavior in Big Data as a function of some key aspects such as the oversampling degree, the neighborhood value and, specially, the type of distributed design (local vs. global).Instituto de Investigación en Informátic

    An insight into imbalanced Big Data classification: outcomes and challenges

    Get PDF
    Big Data applications are emerging during the last years, and researchers from many disciplines are aware of the high advantages related to the knowledge extraction from this type of problem. However, traditional learning approaches cannot be directly applied due to scalability issues. To overcome this issue, the MapReduce framework has arisen as a “de facto” solution. Basically, it carries out a “divide-and-conquer” distributed procedure in a fault-tolerant way to adapt for commodity hardware. Being still a recent discipline, few research has been conducted on imbalanced classification for Big Data. The reasons behind this are mainly the difficulties in adapting standard techniques to the MapReduce programming style. Additionally, inner problems of imbalanced data, namely lack of data and small disjuncts, are accentuated during the data partitioning to fit the MapReduce programming style. This paper is designed under three main pillars. First, to present the first outcomes for imbalanced classification in Big Data problems, introducing the current research state of this area. Second, to analyze the behavior of standard pre-processing techniques in this particular framework. Finally, taking into account the experimental results obtained throughout this work, we will carry out a discussion on the challenges and future directions for the topic.This work has been partially supported by the Spanish Ministry of Science and Technology under Projects TIN2014-57251-P and TIN2015-68454-R, the Andalusian Research Plan P11-TIC-7765, the Foundation BBVA Project 75/2016 BigDaPTOOLS, and the National Science Foundation (NSF) Grant IIS-1447795

    Mining DEOPS records: big data's insights into dictatorship

    No full text
    Historical data provide valuable information for the nderstanding of human interactions through time. However, mining this data is challenging as the available records are generally noise digitized handwritten, typewritten or press printed documents. In this research proposal, we plan to develop tools and techniques for pre-processing and extracting information from documents of the military dictatorship period that ruled Brazil from 1964 to 1985. The data to be analyzed consists of digitized images of records from DEOPS/SP (São Paulo State Department of Political and Social Order), an emblematic police agency which have monitored (and in some cases, harassed and tortured) hundreds of thousands Brazilian citizens during that period. The idea is to use state-of-the-art powerful artificial intelligence algorithms in conjunction with crowd sourcing techniques to pre-process and extract information from this important period of the Brazilian History

    Mining DEOPS records: big data's insights into dictatorship

    No full text
    Historical data provide valuable information for the nderstanding of human interactions through time. However, mining this data is challenging as the available records are generally noise digitized handwritten, typewritten or press printed documents. In this research proposal, we plan to develop tools and techniques for pre-processing and extracting information from documents of the military dictatorship period that ruled Brazil from 1964 to 1985. The data to be analyzed consists of digitized images of records from DEOPS/SP (São Paulo State Department of Political and Social Order), an emblematic police agency which have monitored (and in some cases, harassed and tortured) hundreds of thousands Brazilian citizens during that period. The idea is to use state-of-the-art powerful artificial intelligence algorithms in conjunction with crowd sourcing techniques to pre-process and extract information from this important period of the Brazilian History

    F-Measure Curves for Visualizing Classifier Performance with Imbalanced Data

    No full text
    Training classifiers using imbalanced data is a challenging problem in many real-world recognition applications due in part to the bias in performance that occur for: (1) classifiers that are often optimized and compared using unsuitable performance measurements for imbalance problems; (2) classifiers that are trained and tested on a fixed imbalance level of data, which may differ from operational scenarios; (3) cases where the preference of correct classification of classes is application dependent. Specialized performance evaluation metrics and tools are needed for problems that involve class imbalance, including scalar metrics that assume a given operating condition (skew level and relative preference of classes), and global evaluation curves or metrics that consider a range of operating conditions. We propose a global evaluation space for the scalar F-measure metric that is analogous to the cost curves for expected cost. In this space, a classifier is represented as a curve that shows its performance over all of its decision thresholds and a range of imbalance levels for the desired preference of true positive rate to precision. Experiments with synthetic data show the benefits of evaluating and comparing classifiers under different operating conditions in the proposed F-measure space over ROC, precision-recall, and cost spaces
    corecore